09/873,719 



MS158346.01/MSFTP184US 



Amendments to the Claims 

This listing of claims will replace all prior versions, and listings, of claims in the 
application: 

1 . (Currently amended) A computer implemented system that facilitates 
building a statistical model for a computer readable data set, comprising: 

a first training algorithm that efficiently builds a rough model from a 
subset of the computer readable data set; 

an evaluation component that determines whether the subset of the 
computer readable data set is an appropriate subset to build a model for the computer 
readable data set; [[and]] 

a second training algorithm that builds a refined model for the computer 
readable data set from the subset if deemed appropriate by the evaluation componen t, the 
refined model discovers good clustering of data for a fixed number of clusters; and 

a data scheduler that, based on a data policy, adaptively controls the size 
of subsets for which the first algorithm is applied . 

2. (Cancelled) 

3. (Currently amended) The system of claim 1 [[2]], wherein the data 
scheduler increases the size of the subset to provide a larger aggregate subset of the data 
set if the rough model is unacceptable, the first training algorithm efficiently builds the 
rough model for each larger aggregate subset of the data until the evaluation component 
determines the resulting rough model to be acceptable. 

4. (Currently amended) The system of claim 3, wherein the acceptability of 
each rough model is determined based on a stopping criterion functionally related to an 
expected incremental benefit and a cost associated with increasing the size of the 
aggregate subset of the data set. 
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5. (Currently amended) The system of claim 4, wherein the cost of the 
stopping criterion is functionally related to at least one of time associated with evaluating 
an aggregate data subset of increased size and size of the aggregated subset of the data. 

6. (Currently amended) The system of claim 4, whoroin the stopping 
criterion is defined by 

{l(D HO | 6(D n ))-l(D HO | 0 BASE (DJ) J Cy(Iy -J n )\ AD n+1 | +c 2 (I 1 -J n ) + Cl J n \ D n+1 \ +c 2 J n +c 3 

where 

/(D H o|0(D n )) is a log likelihood for holdout data evaluated for the model 
built by the first training algorithm on a current subset of the training data set, 

/(D H o|0(D n -i)) is a log likelihood for holdout data evaluated for the model 
built by the first training algorithm on a previous subset of the training data set, 

/(D H o|0base(D n )) is a log likelihood for holdout data evaluated for a base 

model, 

Ci, C2, and C3 are constants determined based on application of the second 
training algorithm relative to a first subset of the data set, 

Ii is a number of iterations for the second training algorithm, when applied 
to the first subset, 

J n =— TV,- , and 

n /= i 

Ji is the number of iterations for the first training algorithm when applied 
to a data subset D;, 

I D n +i| is the size of data set D n+ i, 

I AD n+ i| is the increment in size | D n +i| - | D n |, 

X is a user determined stopping threshold . 
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7. (Currently amended) The system of claim 4, wherein the stopping 
criterion is defined by 

f l(D H0 \d(D n ))-l(D H0 \d(D n - 1 )) ') 1_ 

[ 1{D H0 \6{D n )) + S- 1{D H0 | 6 BASE (D n )) J q (I, -J n )\ AD n+1 \ +c 2 (I, -JJ + c x J n \ D n+1 \ +c 2 J n + c 3 



where 

/(D H o|0(D n )) is a log likelihood for holdout data evaluated for the model 
built by the first training algorithm on a current subset of the training data set, 

/(D H o|6(Dn i)) is a log likelihood for holdout data evaluated for the model 
built by the first training algorithm on a previous subset of the training data set, 

/(D H o|0base(Dn)) is a log likelihood for holdout data evaluated for a base 

model, 

8 is an offset associated with a difference in log likelihood for holdout 
data when evaluated for models built on a first subset of the training data set by 
the respective first and second training algorithms, 

Ci, C2, and C3 are constants determined based on application of the second 
training algorithm relative to a first subset of the data set, 

Ii is a number of iterations for the second training algorithm, when applied 
to the first subset, 

J n = —'y\J i , and 

Ji is the number of iterations for the first training algorithm when applied 
to a data subset D;, 

I D n +i| is the size of data set D n+ i, 

I AD n+ i| is the increment in size | D n +i| - | D n |, and 

X is a user determined stopping threshold. 
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8. (Currently amended) The system of claim 1 , wherein the first training 
algorithm further comprises an iterative algorithm, which builds the rough model for the 
subset of the data set according to an associated training policy. 

9. (Currently amended) The system of claim 8, wherein the first training 
algorithm further comprises an associated training policy that defines parameter 
initialization of the first training algorithm for each subset of the data set. 

10. (Currently amended) The system of claim 9, wherein the training policy 
associated with the first training algorithm further controls parameter initialization of the 
first training algorithm, such that at least some of the parameters computed for a previous 
subset of the data are employed to initialize the first training algorithm for a subsequent 
larger aggregate subset of the data. 

1 1 . (Currently amended) The system of claim 9, wherein the first training 
algorithm is initialized by the same parameter values for each subset of the data subset. 

12. (Currently amended) The system of claim 9, wherein the training policy 
sets the iterative algorithm to perform a fixed number of at least one iteration. 

13. (Currently amended The system of claim 12, wherein the training policy 
sets the iterative algorithm to perform a single iteration. 

14. (Currently amended) The system of claim 12, wherein the second training 
algorithm further comprises an iterative algorithm that operates according to an 
associated training policy, so as to produce a more accurate model for the appropriate 
subset of the data set than the first training algorithm. 

15. (Currently amended) The system of claim 14, wherein the iterative 
algorithm associated with at least one of the first and second training algorithms is an 
Expectation and Maximization algorithm. 
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16. (Currently amended) The system of claim 8, wherein the training policy 
associated with the iterative algorithm of the first training algorithm controls the iterative 
algorithm to run until an associated convergence criterion is satisfied. 

17. (Currently amended) The system of claim 16, wherein second training 
algorithm further comprises an iterative algorithm, which builds the refined model for the 
appropriate subset of the data set according to an associated training policy. 

18. (Currently amended) The system of claim 17, wherein the training policy 
associated with the iterative algorithm of the second training algorithm controls the 
respective iterative algorithm to run until an associated convergence criterion is satisfied, 
wherein the convergence criterion associated with the second training algorithm provides 
improved model quality relative to the convergence criterion associated with the first 
training algorithm. 

19. (Currently amended) A computer implemented system programmed to 
facilitate building a statistical model, comprising: 

a first parameter estimation algorithm that efficiently builds a rough model 
from a subset of a computer readable data set based on a training policy associated 
therewith; and 

an evaluation component that determines whether the subset of data from 
which the rough model was built is an appropriate size for building the statistical model 
to characterize the data se t, the evaluation component utilizes a stopping criterion that is 
functionally related to an expected incremental benefit and an expected incremental cost 
associated with increasing the size of the subset of data ; 

a second parameter estimation algorithm that builds a refined model for 
the data set from the subset if determined to have the appropriate size, the second 
parameter estimation algorithm having an associated training policy, which enables the 
second parameter estimation algorithm to build a more accurate model than the first 
parameter estimation algorith m, the more accurate model employed to identify clusters of 
data within the computer readable data set . 
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20. (Previously presented) The system of claim 19, further comprising a data 
scheduler that increases the size of the subset of the data set to provide a larger aggregate 
subset of the data set if the rough model is unacceptable, the first parameter estimation 
algorithm efficiently builds a rough model for each larger aggregate subset until a 
resulting rough model built therefrom is determined to be acceptable. 

21 . (Currently amended) The system of claim 19, wherein the first parameter 
estimation algorithm further comprises an iterative algorithm that builds the rough model 
for each subset of the data set according to the associated training policy. 

22. (Currently amended) The system of claim 2 1 , wherein the training policy 
for the first parameter estimation algorithm is operative to control parameter initialization 
for the first parameter estimation algorithm, such that at least some of the parameters 
computed for a previous subset of the data are employed to initialize the first parameter 
estimation algorithm for a subsequent larger aggregate subset of the data set. 

23 . (Currently amended) The system of claim 2 1 , wherein the first parameter 
estimation algorithm is initialized by the same parameter values for each subset of the 
data subset. 

24. (Currently amended) The system of claim 2 1 , wherein the training policy 
associated with first parameter estimation algorithm controls the iterative algorithm of the 
first parameter estimation algorithm to perform a fixed number of at least one iteration, 
the second training algorithm further comprising an iterative algorithm, which is 
operative to perform a greater number of iterations than the iterative algorithm of the first 
training algorithm based on a training policy associated with the second parameter 
estimation algorithm. 
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25 . (Currently amended) The system of claim 2 1 , wherein the training policy 
associated with the iterative algorithm of the first parameter estimation algorithm controls 
the iterative algorithm to run until an associated convergence threshold is satisfied, 
wherein the second training algorithm further comprises an iterative algorithm, the 
training policy associated with the iterative algorithm of the second parameter estimation 
algorithm being operative to control the respective iterative algorithm to run until an 
associated convergence threshold is satisfied, the convergence threshold associated with 
the second parameter estimation algorithm is less than the convergence threshold 
associated with the first parameter estimation algorithm. 

26. (Cancelled) 

27. (Currently amended) The system of claim 19_ [[26]], wherein the cost of 
the stopping criterion is functionally related to at least one of time associated with 
evaluating the model for a larger subset of data and size of the larger subset of the data. 
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28. (Currently amended) The system of claim 19 [[26]] , wherein the stopping 

criterion is defined by 

f l(D HO \e(D n ))-l(D HO \6(D n _ l )) "\ 1 

[l(D HO | 0(DJ)-l(D HO | 0 BASE (DJ) ) c^I, -J n )\ AD n+l \ -J n ) + c l J n \ D n+l \ +c 2 J n +c 3 

where 

/(D H o|0(D n )) is a log likelihood for holdout data evaluated for the model 
built by the first training algorithm on a current subset of the training data set, 

/(D H o|6(D n -i)) is a log likelihood for holdout data evaluated for the model 
built by the first training algorithm on a previous subset of the training data set, 

/(D H o|0base(D n )) is a log likelihood for holdout data evaluated for a base 

model, 

Ci, C2, and C3 are constants determined based on application of the second 
parameter estimation algorithm relative to a first subset of the data set, 

Ii is a number of iterations for the second parameter estimation algorithm, 
when applied to the first subset, 

J n =— ^V. , and 

Ji is the number of iterations for the first parameter estimation algorithm 
when applied to a data subset D;, 

I D n +i| is the size of data set D n+ i, 

I AD n+ i| is the increment in size | D n +i| - | D n |, and 

X is a user determined stopping threshold. 
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29. (Currently amended) The system of claim 19 [[26]], wherein the stopping 

criterion is defined by 

f l(D H0 \d(D n ))-l(D H0 \d(D n - 1 )) ') 1_ ;A 

[ 1{D H0 \6{D n )) + S- 1{D H0 | 6 BASE (D n )) J q (I, -J n )\ AD n+1 \ +c 2 (I, -JJ + c x J n \ D n+1 \ +c 2 J n + c 3 

where 

/(D H o|0(D n )) is a log likelihood for holdout data evaluated for the model 
built by the first training algorithm on a current subset of the training data set, 

/(D H o|6(Dn i)) is a log likelihood for holdout data evaluated for the model 
built by the first training algorithm on a previous subset of the training data set, 

/(D H o|0base(Dn)) is a log likelihood for holdout data evaluated for a base 

model, 

§ is an offset associated with a difference in log likelihood for holdout 
data when evaluated for models built on a first subset of the training data set by 
the respective first and second training algorithms, 

Ci, C2, and C3 are constants determined based on application of the second 
parameter estimation algorithm relative to a first data subset of the data set, 

Ii is a number of iterations for the second parameter estimation algorithm, 
when applied to a first data subset, 

J n = —'y\J i , and 

Ji is the number of iterations for the first parameter estimation algorithm 
when applied to a data subset D;, 

I D n +i| is the size of data set D n+ i, 

I AD n+ i| is the increment in size | D n +i| - | D n |, and 

X is a user determined stopping threshold. 
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30. (Currently amended) A computer implemented learning curve method to 
facilitate building a statistical model, comprising: 

choosing a subset of a computer readable data set; 

employing a first training algorithm to build a rough model to characterize 

the subset; 

evaluating the rough model; 

if the rough model is unacceptable, repeatedly increasing the size of the 
subset of data to provide an aggregate data set, building another rough model to 
characterize the aggregate subset, and reevaluating the model , the acceptability of each 
rough model based on a stopping criterion functionally related to an expected incremental 
benefit and an expected incremental cost associated with increasing the size of the 
aggregate subset ; and 

if the model is acceptable, employing a second training algorithm to build 
a refined model based on the aggregate data set, the second training algorithm being 
different from the first training algorith m, the refined model identifies data clusters 
contained in the computer readable data set . 

31. (Cancelled) 

32. (Currently amended) The system of claim [[31]] 30, wherein the cost of 
the stopping criterion is functionally related to at least one of time associated with 
evaluating an aggregate data subset of increased size and size of the aggregate subset of 
the data. 
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33. (Currently amended) The system of claim [[31]] 30, wherein the stopping 
criterion is defined by 

f l(D HO \6(D n ))-l(D HO \6(D n _ l )) "\ 1 <A 

[l(D HO \e(D n ))-l(D HO \ 0 BASE (D n )))c l (I l -J n )\ AD n+l \+c 2 (I l -JJ + cJjD^ \ +c 2 J n +c 3 < 

where 

/(D H o|0(D n )) is a log likelihood for holdout data evaluated for the model 
built by the first training algorithm on a current subset of the training data set, 

/(D H o|6(Dn i)) is a log likelihood for holdout data evaluated for the model 
built by the first training algorithm on a previous subset of the training data set, 

/(D H o|6base(D n )) is a log likelihood for holdout data evaluated for a base 

model, 

Ci, C2, and C3 are constants determined based on application of the second 
parameter estimation algorithm relative to a first subset of the data set, 

Ii is a number of iterations for the second parameter estimation algorithm, 
when applied to the first subset, 

J n =— TV,- , and 

n /= i 

Ji is a number of iterations for the first parameter estimation algorithm 
when applied to a data subset D;, 

I D n+ i| is a size of data set D n+ i 5 

I AD n+ i| is an increment in size | D n+ i| - 1 D n |, and 

X is a user determined stopping threshold. 
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34. (Currently amended) The system of claim [[31]] 30, wherein the stopping 

criterion is defined by 

f l(D H0 \d(D n ))-l(D H0 \d(D n - 1 )) ') 1_ ;A 

[ 1{D H0 \6{D n )) + S- 1{D H0 | 6 BASE (D n )) J q (I, -J n )\ AD n+1 \ +c 2 (I, -JJ + c x J n \ D n+1 \ +c 2 J n + c 3 

where 

/(D H o|0(D n )) is a log likelihood for holdout data evaluated for the model 
built by the first training algorithm on a current subset of the training data set, 

/(D H o|6(Dn i)) is a log likelihood for holdout data evaluated for the model 
built by the first training algorithm on a previous subset of the training data set, 

/(D H o|0base(Dn)) is a log likelihood for holdout data evaluated for a base 

model, 

§ is an offset associated with the difference in log likelihood for holdout 
data when evaluated for models built on a first subset of the training data set by 
the respective first and second training algorithms, 

Ci, C2, and C3 are constants determined based on application of the second 
parameter estimation algorithm relative to a first data subset of the data set, 

Ii is a number of iterations for the second parameter estimation algorithm, 
when applied to a first data subset, 

J n = —'y\J i , and 

Ji is a number of iterations for the first parameter estimation algorithm 
when applied to a data subset D;, 

I D n +i| is a size of data set D n+ i, 

I AD n+ i| is an increment in size | D n +i| - 1 D n |, and 

X is a user determined stopping threshold. 

35. (Currently amended) The method of claim 30, wherein the first training 
algorithm is more computationally efficient than the second training algorithm. 
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36. (Currently amended) The method of claim 30, wherein each instance of 
model building repeated until obtaining an acceptable model by the first training 
algorithm employs more efficient and less accurate model building than model building 
employed by the second training algorithm that occurs after obtaining the acceptable 
model. 

37. (Currently amended) The method of claim 36, wherein each instance of 
model building repeated until obtaining an acceptable model employs the first training 
algorithm as an iterative algorithm that is run to a first convergence criterion, the second 
training algorithm employing an iterative algorithm that is run to a second convergence 
criterion, which demands more iterations than the first convergence criterion in order to 
obtain convergence, so that the refined model is more accurate than the rough model built 
by the first training algorithm. 

38. (Currently amended) The method of claim 36, wherein each instance of 
model building repeated until obtaining an acceptable model employs an iterative 
algorithm having a fixed number of at least one iteration, the second training algorithm 
employing an iterative algorithm having a greater number of iterations than the fixed 
number. 

39. (Original) The method of claim 30, further comprising controlling 
parameter initialization employed in each instance of building a model for the aggregate 
data set prior to obtaining an acceptable model. 

40. (Original) The method of claim 39, further comprising initializing the first 
training algorithm by the same parameter values for each subset. 

41 . (Currently amended) The method of claim 39, wherein the controlling 
further comprises reusing at least some of the parameters computed from a previous 
instance of model building to initialize a subsequent instance of model building for a 
subsequent larger aggregate data set prior to obtaining an acceptable model. 
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42. (Currently amended) A computer-readable medium having computer- 
executable instructions for: 

choosing a subset of a computer readable data set; 
building a rough model to characterize the subset based on an associated 
training policy; 

evaluating the rough model; 

if the rough model is unacceptable, repeatedly increasing the size of the 
subset of data to provide an aggregate data set, building a rough model to characterize the 
aggregate subset based on an associated training policy, and reevaluating the rough 
model; [[and]] 

building a refined model for the computer readable data set from the 
aggregate data set if the rough model is determined to be acceptable based on an 
associated training policy that includes determining acceptability based on an expected 
incremental benefit relative to an expected incremental cost associated with increasing 
the size of the aggregate data set; and 

utilizing the refined model to identify identifiable clusters in the computer 
readable data set . 

43. (Cancelled) 



15 



09/873,719 



MS158346.01/MSFTP184US 



44. (Currently amended) A computer implemented method to facilitate 
constructing a statistical model, comprising: 

separating computer readable data into holdout data and training data; 

determining a data subset from the training data by estimating model 
parameters according to a first training policy and evaluating the estimated model 
parameters relative to the holdout data set and repeating the estimation and evaluation of 
model parameters with a larger subset of the training data until an acceptable quality of 
the estimated model is established; [[and,]] 

controlling parameter initialization employed in each estimation of model 
parameters repeatedly until an acceptable size for the determined data subset is achieved; 
and 

subsequent to establishing the acceptable quality of the estimated model, 
using the determined data subset to improve the estimated model parameters by 
employing a second training policy that is more accurate than the first training polic y, the 
estimated model parameters obtained from employment of the second training policy 
utilized to characterize at least one cluster within the computer readable data . 

45. (Currently amended) The method of claim 44, wherein each estimation of 
model parameters repeated until the acceptable quality of the estimated model is 
established further comprises employing an iterative algorithm that is run until a first 
convergence criterion is satisfied, the estimation of model parameters using the 
determined data subset further comprising an iterative algorithm that is run until a second 
convergence criterion is satisfied, which is operative to provide a better quality of model 
than the first convergence criterion. 

46. (Currently amended) The system of claim 45, wherein the first 
convergence criterion causes the associated iterative algorithm to run until a first 
convergence threshold is satisfied, wherein the second convergence criterion causes the 
associated iterative algorithm to run until a second convergence threshold is satisfied, the 
second convergence threshold being less than the first convergence threshold. 
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47. (Currently amended) The method of claim 45, wherein at least one of the 
iterative algorithm run to the first convergence criterion and the iterative algorithm run to 
the second convergence criterion is an Expectation and Maximization algorithm. 

48. (Currently amended) The method of claim 44, wherein each estimation of 
model parameters repeated until the acceptable quality of the estimated model is 
established employs an iterative algorithm having a fixed number of at least one iteration, 
the estimation of model parameters using the determined data subset further employing 
an iterative algorithm having a greater number of iterations than the fixed number. 

49. (Cancelled) 

50. (Currently amended) The method of claim 44, wherein the controlling 
further comprises reusing at least some of the parameters computed from a previous 
estimation of model parameters to initialize a subsequent estimation of model parameters 
for a next larger subset of the training set. 

5 1 . (Currently amended) The method of claim 44, wherein each estimation of 
model parameters repeated until the acceptable quality of the estimated model is 
established further comprises initializing the first training algorithm by the same 
parameter values. 

52. (Original) The method of claim 44, further comprising determining the 
acceptability of the estimated model based on an expected incremental benefit relative to 
a cost associated with increasing the size of the subset of the data set. 
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53. (Currently amended) A computer-readable medium having computer- 
executable instructions for: 

separating computer readable data into holdout data and training data; 

determining a data subset from the training data by estimating model 
parameters and controlling model parameter initialization, according to a first training 
policy and evaluating the estimated model parameters relative to the holdout data set and 
repeating the estimation , initialization, and evaluation of model parameters with a next 
successively larger subset of the training data set until an acceptable quality of the 
estimated model is established; [[and]] 

subsequent to establishing the acceptable quality of the estimated model, 
using the determined data subset to improve the estimated model parameters by 
employing a second training policy that is more accurate than the first training policy; 
and 

utilizing the estimated model parameters determined by utilization of the 
second training policy to identify a cluster in the computer readable data . 
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54. (Currently amended) A computer implemented method to facilitate 
constructing a statistical model, comprising: 

separating computer readable data into a holdout data set and a training 

data set; 

iteratively estimating model parameters for a subset of the training data set 
over a fixed number of iterations and evaluating the estimated model parameters relative 
to the holdout data set; 

repeating the estimation and evaluation of model parameters obtained with 
successively larger subsets of the training data set until an acceptable model quality is 
established , acceptable model quality determined based on an expected incremental 
benefit relative to an expected incremental detriment associated with an increase in size 
of each larger training subset of the data set ; [[and]] 

after the acceptable model quality is established, iteratively estimating 
model parameters for the data subset, which provided the acceptable model quality, until 
a better quality of model is provided relative to a preceding estimation performed over 
the fixed number of iterations ; and 

using the better quality model relative to the computer readable data to 
identify at least a cluster of data within the computer readable data . 

55. (Currently amended) The method of claim 54, wherein at least one of the 
iterative estimations employs an Expectation and Maximization algorithm. 

56. (Currently amended) The method of claim 54, wherein the estimation that 
occurs after the acceptable model quality is established, further comprises employing an 
iterative algorithm having a greater number of iterations than the fixed number. 

57. (Currently amended) The method of claim 54, wherein the estimation of 
model parameters after the acceptable model quality has been established further 
comprises employing an iterative algorithm that is run until a convergence criterion is 
satisfied, which is operative to provide a better quality of model with the data subset than 
a preceding estimation employing the fixed number of iterations. 
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58. (Original) The method of claim 54, further comprising controlling 
parameter initialization for each estimation of model parameters that occurs before the 
acceptable model quality has been established. 

59. (Currently amended) The method of claim 58, whoroin each iterative 
estimation until the acceptable model quality is established further comprises initializing 
the first training algorithm by the same parameter values. 

60. (Currently amended) The method of claim 58, whoroin the controlling 
further comprises reusing at least some of the parameters obtained in a previous 
estimation of model parameters to initialize a subsequent estimation of model parameters 
for a next larger subset of the training data set. 

61. (Cancelled) 
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62. (Currently amended) A computer implemented method to facilitate 
constructing a statistical model, comprising: 

separating computer readable data into a holdout data set and a training 

data set; 

iteratively estimating model parameters for a subset of the training data set 
until a first convergence threshold is satisfied and evaluating the estimated model 
parameters relative to the holdout data set; 

repeating the estimation and evaluation of model parameters obtained with 
successively larger subsets of the training data set until determining a size of data subset 
that provides acceptable model parameters , acceptable model parameters attained where 
the expected marginal cost outweighs the expected marginal benefit associated with 
successively larger subsets ; [[and]] 

after determining the size of data subset that provides acceptable model 
parameters, iteratively estimating model parameters for a data subset of the acceptable 
size until a second convergence threshold is satisfied, the second convergence threshold 
being less than the first convergence threshol d; and 

based at least on the estimated model parameters identified at the second 
convergence threshold, identifying a good clustering of data relative to the computer 
readable data . 
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63. (Currently amended) A computer implemented system to facilitate 
building a statistical model for a computer readable data set, comprising: 

first means for building a rough model to characterize a subset of the 
computer readable data set; 

evaluation means for evaluating the acceptability of the rough model 
based at least in part on an expectational cost-benefit analysis , the first means building 
another rough model for a larger subset of the data if the evaluation means determines 
that a prior rough model is unacceptable; [[and]] 

second means, which is different from the first means, for building a 
refined model from an aggregate subset of data that yielded the rough model deemed 
acceptable by the evaluation means ; and 

means for identifying a cluster of data within the computer readable data 
set based in part on the refined model . 

64. (Currently amended) A computer implemented system to facilitate 
building a statistical model for a computer readable data set, comprising: 

first means for estimating model parameters from a subset of the computer 
readable data set; 

means for evaluating the estimated model parameters relative to a holdout 
set of the data set; 

means for determining a data subset from the training data by causing the 
first means and the means for evaluating to respectively repeat estimation and evaluation 
of model parameters with a next successively larger subset of the training data set until an 
acceptable quality of the model parameters is established , the quality of the model 
parameters established when the expected cost of generating the next successively larger 
subset outweighs the expected benefit in accuracy of utilizing the next successively larger 
subset ; [[and]] 

second means for estimating model parameters based on the determined data 
subset to provide a more accurate estimation of model parameters than the first means ; and 

means for determining a cluster of data contained in the computer readable 
data set based on the more accurate estimation of model parameters . 



